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Abstract 

Given the growing number of patents filed in multiple countries, users are interested in retrieving patents across languages. 
We propose a multi-lingual patent retrieval system, which translates a user query into the target language, searches a 
multilingual database for patents relevant to the query, and improves the browsing efficiency by way of machine translation 
and clustering. Our system also extracts new translations from patent families consisting of comparable patents, to enhance 
the translation dictionary. 
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1 Introduction 

Given the growing number of patents filed in mul- 
t iple. countries, it is feasible tbRt users are interested in 



be combined with existing monolingual retrieval sys- 
tems. 



retrieving patent infnrmatinn across languages Hnw- 



Following a query translation method (Fujii and 
Ishikawa, 1999; Fujii and Ishikawa, To appear), we pre- 



ever, many users find it difficult to perform patent re- 
trieval (i.e., formulating queries, searching databases 
for relevant patents, and browsing retrieved patents) 
in foreign languages. 

To counter this problem, cross-language informa- 
tion retrieval (CLIR), where queries in one language 
are submitted to retrieve documents in another lan- 
guage, can be an effective solution. CLIR has of late 
become one of the major topics within the information 
retrieval and natural language processing communities. 
In fact, a number of methods/systems for CLIR have 
been proposed. 

Since by definition queries and documents are in 
different languages, queries and documents need to be 
standardized into a common representation, so that 
monolingual retrieval techniques can be applied. From 
this point of view, existing CLIR methods are classified 
into the following three fundamental categories. 

The first method translates queries into the doc- 



viously proposed a Japanese/English cross-language 
patent retrieval system (Fukui et al., 2000), where 
users submit queries in either Japanese or English to 
retrieve patents in the other language. In either case, 
the target database is monolingual. 

However, since users are not always sure as to which 
language database contains patents relevant to their 
information need, it is effective to retrieve patents 
in multiple languages simultaneously. This process, 
which we shall call "multi-lingual information retrieval 
(MLIR)", is an extension of CLIR. In this paper, we 
propose a Japanese/English multi-lingual patent re- 
trieval system called "PRIME" (Patent Retrieval In 
Multi- lingual Environment), 

The design of our system is based on that for techni- 



umcnt language (Ballesteros and Croft, 199? ; Fujii 



and Ishikawa, To appear; Nic ct al., 1999] ), and the 



second method translates documents into the query 
language ( [McCarley, 199$ pard, 1998[ ). The third 
method projects both queries and documents into 
a language-independent space by way of thesaurus 



classes ( Gonzalo et al., 199S ; Salton, 197C ) and latent 
semantic indexing ( Carbonell et al., 1997 ; Liftman et 
ah, 1998|) . 



Among those above methods, the first one (i.e., 
query translation method) is preferable in terms of im- 
plementation cost, because this approach can simply 



cal documents (Fujii and Ishikawa, 2001), which com- 
bines query translation, document retrieval, document 
translation and clustering modules (Section ||). 

Additionally, in this paper we newly introduce a 
module for enhancing a dictionary used for the query 
translation module. For this purpose, we propose a 
method to extract Japanese/English translations from 
patent families consisting of comparable patents filed 
in Japan and the United States (Section ||). 



2 System Description 

2.1 Overview 

Figure|| depicts the overall design of PRIME, which 
retrieves documents in response to user queries in ei- 
ther Japanese or English. However, unlike the case of 



CLIR, retrieved documents can potentially be in either 
a combination of Japanese and English or either of the 
languages individually. We briefly explain the entire 
on-line process based on this figure. 

First, a user query is translated into the foreign 
language (i.e., either Japanese or English) by way of a 
query translation module. 

Second, a document retrieval module uses both 
the source (user) and translated queries to search a 
Japanese/English bilingual patent collection for rele- 
vant documents. 

In real world usage, Japanese and English patents 
are not comparable in the collection (this is the ma- 
jor reason why cross/multi- lingual retrieval is needed). 
However, for the purpose of research and development, 
we currently target a comparable collection. 

To put it more precisely, the collection contains ap- 
proximately 1,750,000 pairs of Japanese abstracts and 
their English translations, which were provided on PAJ 
(Patent Abstract of Japan) CD-ROMs in 1995-19990. 

Third, among retrieved documents, only those that 
are in the foreign language are translated into the user 
language through a document translation module. 

In principle, we need only above three modules to 
realize multi-lingual patent retrieval in the sense that 
users can retrieve/browse foreign documents through 
their native language. However, to improve the brows- 
ing efficiency, a clustering module finally divides re- 
trieved documents into a specific number of groups. 

Additionally, in the off-line process, a translation 
extraction module identifies Japanese/English transla- 
tions in the database, to enhance the query translation 
module. 

2.2 Query Translation 

The query translation module is based on the 
meth od proposed by Fujii and Ishikawa ( 1999] ; To ap- 



peai), which has been applied to Japanese/English 
CLIR for the NTCIR collection consisting of techni- 



cal abstracts (Kando et al., 1999). 

This method translates words and phrases (com- 
pound words) in a given query, maintaining the word 
order in the source language. A preliminary study 
showed that approximately 95% of compound tech- 



nical terms defined in a bilingual dictionary (Ferber 



1989) maintain the same word order in both Japanese 
and English. 

Then, the Nova dictionary^] is used to derive pos- 
sible word/phrase translations, and a probabilistic 
method is used to resolve translation ambiguity. 

The Nova dictionary includes approximately one 
million Japanese-English translations related to 19 
technical fields as listed below: 

aeronautics, biotechnology, business, chem- 
istry, computers, construction, defense, 



Query translation 



Dictionary 









Translation 
^model 







Language 
model 




Figure 1: The design of PRIME: our multi- lingual 
patent retrieval system (dashed arrows denote the off- 
line process). 



ecology, electricity, energy, finance, law, 
mathematics, mechanics, medicine, metals, 
oceanography, plants, trade. 

In addition, for words unlisted in the Nova dictio- 
nary, transliteration is performed to identify phonetic 
equivalents in the target language. Since Japanese 
often represents loanwords (i.e., technical terms and 
proper nouns imported from foreign languages) using 
its special phonetic alphabet (or phonogram) called 
"katakana" , with which new words can be spelled out, 
transliteration is effective to improve the translation 
quality. 

We represent the user query and one translation 
candidate in the document language by U and D, re- 
spectively. From the viewpoint of probability theory, 
our task here is to select D's with greater probability, 
P(D\U), which can be transformed as in Equation (|l|) 
through the Bayesian theorem. 



P(D\U) 



P(U\D)-P(D) 
W) 



(1) 



1 Copyright by Japan Paten t Office. 

2 Developed by NOVA, Inc. jittp: / /www. nova.co.jp/ 



In practice, P{U) can be omitted because this factor 
is a constant with respect to the given query, and thus 
does not affect the relative probability for different 
translation candidates. 

P(D) is estimated by a word-based bi-gram lan- 
guage model produced from the target collection. 
P(U\D) is estimated based on the word frequency ob- 
tained from the Nova dictionary. Those two factors 



are commonly termed language and translation mod- 
els, respectively (see Figure |l|). 

2.3 Document Retrieval 

The retrieval module is based on an existing prob- 
abilistic retrieval method ( Robertson and Walker 
1994 ), which computes the relevance score between the 
translated query and each document in the collection. 
The relevance score for document i is computed based 
on Equation (0). 



E 



TF, 



. avglen 



TF, 



log 



N 
DFt 



(2) 



Here, TF t ^ denotes the frequency that term t appears 
in document i. DF t and ./V denote the number of docu- 
ments containing term t and the total number of doc- 
uments in the collection. DLi denotes the length of 
document i (i.e., the number of characters contained 
in «), and avglen denotes the average length of docu- 
ments in the collection. 

For both Japanese and English collections, we use 
content words extracted from documents as terms, and 
perform a word-based indexing. For the Japanese 
collection, we use the ChaSen morphological ana- 
lyzer ( Matsumoto et al., 1999] ) to extract content 
words. However, for the English collection, we extract 
content words based on parts-of-speech as defined in 
WordNet ( |Fcllbaum, 199g| ). 



2.4 Document Translation 

The document translation module consists of the 
the Transer Japanese/English MT system, which uses 
the same dictionary used for the query translation 
module. 

In practice, since machine translation is computa- 
tionally expensive and degrades the time efficiency, we 
perform machine translation on a phrase-by-phrase ba- 
sis. In brief, phrases are sequences of content words in 
documents, for which we developed rules to generate 
phrases based on the part-of-speech information. This 
method is practical because even a word/phrase-based 
translation can potentially improve on the efficiency 
for users to find relevant foreign documents from the 
whole retrieval result ( Oard and Rcsnik, 1999] ). 



2.5 Clustering 

For the purpose of clustering retrieved documents, 
we use the Hierarchical Bayesian Clustering (HBC) 
method (Iwayama and Tokunaga, 1995), which merges 
similar items (i.e., documents in our case) in a bottom- 
up manner, until all the items are merged into a single 
cluster. Thus, a specific number of clusters can be 
obtained by splitting the resultant hierarchy at a pre- 
determined level. 

The HBC method also determines the most repre- 
sentative item (centroid) for each cluster. Thus, we 



can enhance the browsing efficiency by presenting only 
those centroids to users. 

The similarity between documents is computed 
based on feature vectors that characterize each doc- 
ument. In our case, vectors for each document consist 
of frequencies of content words appearing in the doc- 
ument. We extract content words from documents as 



performed in word-based indexing (see Section 2.3). 

Given the clustering module, the system can fa- 
cilitate an interactive retrieval. To put it more pre- 
cisely, through the interface, users can discard irrel- 
evant clusters determined by browsing representative 
documents, and re-cluster the remaining documents. 
By performing this process recursively, relevant docu- 
ments are eventually remained. 

3 Extracting Translations Using 
Patent Families 

3.1 Overview 

Since patents are usually associated with new 
words, it is crucial to translate out-of-dictionary words. 
The transliteration method used in the query transla- 
tion module is one solution for this problem (see Sec- 
tion 



2.2) 



On the other hand, it is also effective to update the 
translation dictionary. For this purpose, a number of 
methods to extract translations from bilingual (par- 
allel/comparable) corpora (fgmadja et al., 1996 ; |Ya-| 



mamoto and Matsumoto, 2000| ) are applicable. How- 
ever, it is considerably expensive to obtain bilingual 
corpora with sufficient volume of alignment informa- 
tion. 

To resolve this problem, we use patent families, 
which are patent sets filed for the same/related con- 
tents in multiple countries, as comparable corpora. 
Thus, patents contained in the same family are not 
necessarily parallel, but quite comparable. 

Among a number of ways to apply for patents in 
multiple countries, we focus solely on patents claim- 
ing priority under the Paris Convention, because we 
can easily identify patent families by the identification 
number assigned to each patent. 

In addition, the number of patent families is still 
increasing. Thus, we can easily update a large-scale 
bilingual comparable corpus based on patent families. 
To the best of our knowledge no research has utilized 
patent families for extracting translations. 

3.2 Methodology 

Since patents are structured with a number of fields 
(e.g., titles, abstracts, and claims), our method first 
identifies corresponding fragments based on the docu- 
ment structure, to improve the extraction accuracy. 

However, structures of paired patents are not al- 
ways the same. For example, the number of fields 
claimed in a single patent family often varies depend- 
ing on the language. Thus, we use only the title and 



abstract fields, which usually parallel in Japanese and 
English patents. In other words, unlike the case of 
most existing extraction methods, our method does 
not need sentence-aligned corpora. 



Table 1: Accuracy for translation extraction. 



We use the ChaS en morphological analyzer (Mat 
sum( |)to ct al., 1999| ) and Brill tagger ( [Brill, 1995| )To 



extract content words from Japanese and English frag- 
ments, respectively. In addition, we combine more 
than one word into phrases, for which we developed 
rules to generate phrases based on the part-of-speech 
information. 

We then compute the association score for all the 
possible combinations of Japanese/English phrases co- 
occurring in the same fragment, and select those with 
greater score as the final translations. For this purpose, 
we use the weighted Dice coefficient (Yamamoto and 
Mati iumoto, 200(i| ) as shown in Equation (|3j). 



score(Wj, W e ) = logF, 



IF, 



(3) 



Here, Wj and W e are Japanese and English phrases, 
respectively. Fj and F e denote the frequency that Wj 
and W e appear in the entire corpus, respectively. Fj e 
denotes the frequency that Wj and W e co-occur in the 
same fragment. The logarithm factor is effective to dis- 
card infrequent co-occurrences, which usually decrease 
the extraction accuracy. 

3.3 Experimentation 

A preliminary study showed that out of approxi- 
mately 1,750,000 patents filed in Japan (1995-1999), 
approximately 32,000 patents were paired with those 
filed in the United States as patent families. Thus, 
in practice we obtained a bilingual comparable corpus 
consisting of 32,000 Japanese/English pairs. From this 
corpus, our method extracted 1,234,347 phrase-based 
translations, which were judged it correct or incorrect. 

However, we selected translations association whose 
score was above 1.5, and manually judged their correct- 
ness, because a) the judgement can be considerably ex- 
pensive for the entire translations, and b) translations 
with small association scores are usually incorrect. The 
total number of selected translations was 37,669. 

We then evaluated the accuracy of our extraction 
method. The accuracy is the ratio between the num- 
ber of correct translations, and the number of cases 
where the association score of the translation is above 
a specific threshold. By raising the value of the thresh- 
old, the accuracy also increased, while the number of 
extracted translations decreased, as shown in Table [|. 
According to this table, we could achieve a high accu- 
racy by limiting the number of translations extracted. 

We spent only four man-days in judging the 37,669 
translations and identifying 5,879 correct translations. 
In other words, our method facilitated to produce bilin- 
gual lexicons semi-automatically with a trivial cost. 



Threshold for Score 


1.5 


2.0 


3.0 


4.0 


5.0 


# of Translations 


37,669 


24,869 


4,419 


962 


356 


# of Correct Translations 


5,879 


4,129 


1,399 


564 


240 


Accuracy (%) 


15.6 


16.6 


31.7 


58.6 


67.4 



4 Conclusion 

In this paper, we proposed a multi-lingual system 
for Japanese/English patent retrieval. For this pur- 
pose, we used a query translation method explored in 
cross- language information retrieval (CLIR). 

However, unlike the case of CLIR, our system re- 
trieves bilingual patents simultaneously in response to 
a monolingual query. Our system also summarizes re- 
trieved patents by way of machine translation and clus- 
tering to improve the browsing efficiency. 

In addition, our system includes an extraction mod- 
ule which produces new translations from patent fami- 
lies consisting of comparable patents, and updates the 
translation dictionary. 

Future work would include improving existing mod- 
ules in our system, and the application of our frame- 
work to other languages. 
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